Scanpy Preprocessing and Clustering Tutorial¶
October 20th, 2025
Importing libraries and data¶
This data is from 10X Multiome Gene Expression and Chromatin Accessability kit. The data is from bone marrow mononuclear cells.
# Core scverse libraries
import scanpy as sc
import anndata as ad
import hisepy
samples = {
"s1d1": "8c4c7e50-4888-4d98-901b-b9356056d1b8",
"s1d3": "4d420fd6-c2a3-4d96-a6cf-1a2d717c95f4",
}
adatas = {}
for sample_id, uuid in samples.items():
path = hisepy.cache_files([uuid])[0]
sample_adata = sc.read_10x_h5(path)
sample_adata.var_names_make_unique()
adatas[sample_id] = sample_adata
adata = ad.concat(adatas, label="sample")
adata.obs_names_make_unique()
print(adata.obs["sample"].value_counts())
adata
/home/workspace/environment/pythonscrna12/lib/python3.13/site-packages/leidenalg/VertexPartition.py:413: SyntaxWarning: invalid escape sequence '\m'
.. math:: Q = \\frac{1}{m} \\sum_{ij} \\left(A_{ij} - \\frac{k_i^\mathrm{out} k_j^\mathrm{in}}{m} \\right)\\delta(\\sigma_i, \\sigma_j),
/home/workspace/environment/pythonscrna12/lib/python3.13/site-packages/leidenalg/VertexPartition.py:788: SyntaxWarning: invalid escape sequence '\m'
.. math:: Q = \\sum_{ij} \\left(A_{ij} - \\gamma \\frac{k_i^\mathrm{out} k_j^\mathrm{in}}{m} \\right)\\delta(\\sigma_i, \\sigma_j),
/home/workspace/environment/pythonscrna12/lib/python3.13/site-packages/leidenalg/Optimiser.py:27: SyntaxWarning: invalid escape sequence '\g'
implementation therefore does not guarantee subpartition :math:`\gamma`-density.
/home/workspace/environment/pythonscrna12/lib/python3.13/site-packages/leidenalg/Optimiser.py:346: SyntaxWarning: invalid escape sequence '\s'
.. math:: Q = \sum_k \\lambda_k Q_k.
/home/workspace/environment/pythonscrna12/lib/python3.13/site-packages/anndata/_core/anndata.py:1793: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
/home/workspace/environment/pythonscrna12/lib/python3.13/site-packages/anndata/_core/anndata.py:1793: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
/home/workspace/environment/pythonscrna12/lib/python3.13/site-packages/anndata/_core/anndata.py:1793: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
/home/workspace/environment/pythonscrna12/lib/python3.13/site-packages/anndata/_core/anndata.py:1793: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
sample s1d1 8785 s1d3 8340 Name: count, dtype: int64
/home/workspace/environment/pythonscrna12/lib/python3.13/site-packages/anndata/_core/anndata.py:1791: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
utils.warn_names_duplicates("obs")
AnnData object with n_obs × n_vars = 17125 × 36601
obs: 'sample'
The data contains 8785 cells for the first sample and 8340 cells for the second sample.
There are 36,601 measured genes
The warnings seem to be warnings regarding the variable names not being unique, I don't think this will be an issue in the future, but I am taking note of this.
Quality Control¶
calculate_qc_metrics() is a scanpy function that calculates the common quality control metrics. This is based on calculateQCMetrics.
# each gene is defined by distinct prefixes below
# mitochondrial genes, "MT-" for human, "Mt-" for mouse
adata.var["mt"] = adata.var_names.str.startswith("MT-")
# ribosomal genes
adata.var["ribo"] = adata.var_names.str.startswith(("RPS", "RPL"))
# hemoglobin genes
adata.var["hb"] = adata.var_names.str.contains("^HB[^(P)]")
sc.pp.calculate_qc_metrics(
adata, qc_vars=["mt", "ribo", "hb"], inplace=True, log1p=True
)
Violin plots are used to inspect the QC metrics.
The three plots from left to right represnt:
- the number of genes expressed in the count matrix
- the total number of counts per cell
- the percentage of counts in the mitochondrial genes.
sc.pl.violin(
adata,
["n_genes_by_counts", "total_counts", "pct_counts_mt"],
jitter=0.4,
multi_panel=True,
)
Along with violin plots, a scatter plot is also a good visualization to insepct the QC metrics.
sc.pl.scatter(
adata, "total_counts", "n_genes_by_counts", color="pct_counts_mt"
)
From the violin plots and the scatter plot, I can remove the cells that have too many mitochondrial genes expressed OR too many total counts.
I can do this by setting a manual or automatic threshold. Sometimes, a poor QC metric can be driven by real biology, so be cautious when filtering and you should revisit this filtering later on.
With these things in mind, they filtered cells with less than 100 genes expressed and genes that are detected in less than 3 cells.
ALSO: for datasets with multiple batches, quality control should be performed for each sample individually as quality control thresholds can vary substantially between batches.
# filtering for genes that have at least 100 genes
sc.pp.filter_cells(adata, min_genes=100)
# filtering for genes that are detected in more than 3 cells
sc.pp.filter_genes(adata, min_cells=3)
Doublet detection¶
A doublet can lead to misclassification or distortion in later analysis steps. Scanpy has its own doublet detection: scanpy.pp.scrublet()!
This function predicts cell doublets using nearest-neighbor classifications of observed transcriptones and simulated doublets. It adds doublet_score and predicted_doublet to .obs.
You can filter directly using predicted_doublet.
You can use doublet_score to filter later during the clustering step where you filter out clusters with high doublet scores.
sc.pp.scrublet(adata, batch_key="sample")
Normalization¶
We need to then normalize the data. A common approach is using count depth scaling paired with subsequent log plus one (log1p) transformation. scanpy.pp.normalize_total
Count depth scaling normalizes the data to a “size factor” (examples: median count depth in the dataset, ten thousand (CP10k), one million (CPM), etc).
target_sum is the parameter that controls the size factor for count depth scaling.
# Saving count data
adata.layers["counts"] = adata.X.copy()
# median count depth normalization with log1p transformation (log1PF)
# Normalizing to median total counts
sc.pp.normalize_total(adata)
# Logarithmize the data
sc.pp.log1p(adata)
Feature Selection¶
We need to reduce dimensionality of the dataset to only include the most informative genes! scanpy.pp.highly_variable_genes
This function annotates highly variable genes through reproducing the implmentations based on the flavor chosen: Seurat, Cell Ranger, or Seurat v3.
# getting the top 2000 genes
sc.pp.highly_variable_genes(
adata, n_top_genes=2000, batch_key="sample"
)
# plotting!
sc.pl.highly_variable_genes(adata)
Dimensionality Reduction¶
Reduce the data's dimensionality by running principal component analysis (PCA). scanpy.tl.pca()
PCA reveals the main axes of variation and de-noises the data.
sc.tl.pca(adata)
Plot the variance ration of the PCAs to inspect the contribution of a single PCs to the total variance in the data.
This plot gives us information about how many PCs should be considered in order to compute the neighborhood relations of cells (used in the clustering function leiden() or tsne()).
There doesn't seem to be signifigant downside to overestimating the numer of principal components, so to be safe overestimate!
# plotting the pca variance ratio
sc.pl.pca_variance_ratio(adata, n_pcs=50, log=True)
Plotting the PCs to see any potentially undesired features that are driving significant variation in the dataset (such as batch or QC metrics).
# plotting each sample and the PCs
sc.pl.pca(
adata,
color=["sample", "sample", "pct_counts_mt", "pct_counts_mt"],
dimensions=[(0, 1), (2, 3), (0, 1), (2, 3)],
ncols=2,
size=2,
)
Nearest Neighbor Graph Construction and Visualization¶
Using the PCA represntation fo the data, we compute the neighborhood graph of cells. scanpy.pp.neighbors()
# computing the neighborhood graph of cells
sc.pp.neighbors(adata)
Using scanpy.tl.umap() the graph can be embedded in 2 dimensions for visualization with UMAP
# embedding the graph into 2 dimensions
sc.tl.umap(adata)
# plotting the UMAP
sc.pl.umap(
adata,
color="sample",
# Setting a smaller point size to get prevent overlap
size=2,
)
There is only a minor batch effect eventhough the data has 2 different samples. We continue with the next steps (clustering and annotating).
If there is a large inspect batch effect in the UMAP it can be beneficial to integrate across samples and perform batch correction/integration. scanorama and scvi-tools are useful for batch integration.
Clustering¶
Scanpy recommends Leiden graph-clustering method. scanpy.tl.leiden()
This method directly clusters the neighborhood graph of cells, which was already computed in the nearest neighbor graph construction step.
# Using the igraph implementation and a fixed number of iterations can be significantly faster, especially for larger datasets
sc.tl.leiden(adata, flavor="igraph", n_iterations=2)
# plotting the UMAP
sc.pl.umap(adata, color=["leiden"])
Re-asseing Quality Control and Cell Filtering¶
As stated in the Quality Control step, we need to revisit the filtering step. Visualizing different QC metrics using UMAP can be helpful to re-assess.
sc.pl.umap(
adata,
color=["leiden", "predicted_doublet", "doublet_score"],
# increase horizontal space between panels
wspace=0.5,
size=3,
)
sc.pl.umap(
adata,
color=["leiden", "log1p_total_counts", "pct_counts_mt", "log1p_n_genes_by_counts"],
wspace=0.5,
ncols=2,
)
Manual Cell-type Annotation¶
At this point we have a set of cells with good quality and we can move onto annotating them to their known cell types.
Genes that are exclusively expressed by a certain cell type (the marker gene of a cell type) is used to distinguish the heterogeneous groups of cells in the data.
During annotation, cell type annotation uses the marker genes to group the cells into clusters.
# using Leiden clustering algorithm
# (extracts cell communities from nearest neighbours graph
for res in [0.02, 0.5, 2.0]:
sc.tl.leiden(
adata, key_added=f"leiden_res_{res:4.2f}", resolution=res, flavor="igraph"
)
resolution is a parameter that is used to control the number of clusters. The number of clusters is bound to the stable and biologically-meaningful groups that can ultimately distinguish.
sc.pl.umap(
adata,
color=["leiden_res_0.02", "leiden_res_0.50", "leiden_res_2.00"],
legend_loc="on data",
)
UMAPs should not be over-interpreted!
Here, the highest resolution in the data is over-clustered. The lowest resolution is likely grouping cells which belog to distinct cell identies.
Marker Gene Set¶
Here, we define a set of marker genes for the main cell types that we expect to see in the dataset.
marker_genes = {
"CD14+ Mono": ["FCN1", "CD14"],
"CD16+ Mono": ["TCF7L2", "FCGR3A", "LYN"],
# Note: DMXL2 should be negative
"cDC2": ["CST3", "COTL1", "LYZ", "DMXL2", "CLEC10A", "FCER1A"],
"Erythroblast": ["MKI67", "HBA1", "HBB"],
# Note HBM and GYPA are negative markers
"Proerythroblast": ["CDK6", "SYNGR1", "HBM", "GYPA"],
"NK": ["GNLY", "NKG7", "CD247", "FCER1G", "TYROBP", "KLRG1", "FCGR3A"],
"ILC": ["ID2", "PLCG2", "GNLY", "SYNE1"],
"Naive CD20+ B": ["MS4A1", "IL4R", "IGHD", "FCRL1", "IGHM"],
# Note IGHD and IGHM are negative markers
"B cells": [
"MS4A1",
"ITGB1",
"COL4A4",
"PRDM1",
"IRF4",
"PAX5",
"BCL11A",
"BLK",
"IGHD",
"IGHM",
],
"Plasma cells": ["MZB1", "HSP90B1", "FNDC3B", "PRDM1", "IGKC", "JCHAIN"],
# Note PAX5 is a negative marker
"Plasmablast": ["XBP1", "PRDM1", "PAX5"],
"CD4+ T": ["CD4", "IL7R", "TRBC2"],
"CD8+ T": ["CD8A", "CD8B", "GZMK", "GZMA", "CCL5", "GZMB", "GZMH", "GZMA"],
"T naive": ["LEF1", "CCR7", "TCF7"],
"pDC": ["GZMB", "IL3RA", "COBLL1", "TCF4"],
}
# plotting a dotplot of the marker genes
sc.pl.dotplot(
adata, marker_genes, groupby="leiden_res_0.02", standard_scale="var"
)
Looking at the dotplot of the marker genes, there are pretty clear patterns of expression. We can use this information to label the coarsest clustering with borad lineages.
adata.obs["cell_type_lvl1"] = adata.obs["leiden_res_0.02"].map(
{
"0": "Lymphocytes",
"1": "Monocytes",
"2": "Erythroid",
"3": "B Cells",
}
)
# plotting the dotplot of the marker genes
sc.pl.dotplot(
adata, marker_genes, groupby="leiden_res_0.50", standard_scale="var"
)
Looking at this dotplot, this is a resolution that is suitable to distinguish most of the different cell types in the data.
Now, lets annotate these manually using the dotplot above and the UMAP of the clusters!
Ideally, one will look at each cluster specifically and attempt to subcluster if required.
Differentially expressed Genes as Markers¶
Using the Wilcoxon and t-test to calculate the marker genes per cluster, we can look up whether we can link the marker genes to any known biology (cell types and/or states).
# Obtain cluster-specific differentially expressed genes
sc.tl.rank_genes_groups(
adata, groupby="leiden_res_0.50", method="wilcoxon"
)
# plotting a dotplot of the top 5 differentially-expressed genes
sc.pl.rank_genes_groups_dotplot(
adata, groupby="leiden_res_0.50", standard_scale="var", n_genes=5
)
WARNING: dendrogram data not found (using key=dendrogram_leiden_res_0.50). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
Using these genes we can figure out what cell types we are looking at.
EXAMPLE: Cluster 7 is expressing NKG& and GNLY, this suggests that these are NK cells.
scanpy.get.rank_genes_groups_df() can be used extract differentially expressed genes in a convient format. This allows you to create your own plots in a more automated approach.
# ranking the genes in group 7
sc.get.rank_genes_groups_df(adata, group="7").head(5)
| names | scores | logfoldchanges | pvals | pvals_adj | |
|---|---|---|---|---|---|
| 0 | CD79B | 32.199940 | 5.769280 | 1.768379e-227 | 4.142783e-223 |
| 1 | TCL1A | 30.840637 | 6.090436 | 7.479323e-209 | 8.760904e-205 |
| 2 | CD24 | 30.047489 | 5.935766 | 2.354551e-198 | 1.838669e-194 |
| 3 | SOX4 | 29.995842 | 5.294852 | 1.111875e-197 | 6.511976e-194 |
| 4 | IGHM | 29.938829 | 5.333653 | 6.150125e-197 | 2.881580e-193 |
dc_cluster_genes = sc.get.rank_genes_groups_df(adata, group="7").head(5)["names"]
sc.pl.umap(
adata,
color=[*dc_cluster_genes, "leiden_res_0.50"],
legend_loc="on data",
frameon=False,
ncols=3,
)
The p values are extreamly low here. This is because of the statistical test being preformed considering each cell is an independent sample.
To make a more conservative approach, consider "pseudo-bulking" the data by sample: sc.get.aggregate(adata, by=["sample", "cell_type"], func="sum", layer="counts").
Also, consider using a more powerful differential expression tool like pydeseq2.